38 research outputs found
Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions
We aim for zero-shot localization and classification of human actions in
video. Where traditional approaches rely on global attribute or object
classification scores for their zero-shot knowledge transfer, our main
contribution is a spatial-aware object embedding. To arrive at spatial
awareness, we build our embedding on top of freely available actor and object
detectors. Relevance of objects is determined in a word embedding space and
further enforced with estimated spatial preferences. Besides local object
awareness, we also embed global object awareness into our embedding to maximize
actor and object interaction. Finally, we exploit the object positions and
sizes in the spatial-aware embedding to demonstrate a new spatio-temporal
action retrieval scenario with composite queries. Action localization and
classification experiments on four contemporary action video datasets support
our proposal. Apart from state-of-the-art results in the zero-shot localization
and classification settings, our spatial-aware embedding is even competitive
with recent supervised action localization alternatives.Comment: ICC
Counting with Focus for Free
This paper aims to count arbitrary objects in images. The leading counting
approaches start from point annotations per object from which they construct
density maps. Then, their training objective transforms input images to density
maps through deep convolutional networks. We posit that the point annotations
serve more supervision purposes than just constructing density maps. We
introduce ways to repurpose the points for free. First, we propose supervised
focus from segmentation, where points are converted into binary maps. The
binary maps are combined with a network branch and accompanying loss function
to focus on areas of interest. Second, we propose supervised focus from global
density, where the ratio of point annotations to image pixels is used in
another branch to regularize the overall density estimation. To assist both the
density estimation and the focus from segmentation, we also introduce an
improved kernel size estimator for the point annotations. Experiments on six
datasets show that all our contributions reduce the counting error, regardless
of the base network, resulting in state-of-the-art accuracy using only a single
network. Finally, we are the first to count on WIDER FACE, allowing us to show
the benefits of our approach in handling varying object scales and crowding
levels. Code is available at
https://github.com/shizenglin/Counting-with-Focus-for-FreeComment: ICCV, 201
Infinite Class Mixup
Mixup is a widely adopted strategy for training deep networks, where
additional samples are augmented by interpolating inputs and labels of training
pairs. Mixup has shown to improve classification performance, network
calibration, and out-of-distribution generalisation. While effective, a
cornerstone of Mixup, namely that networks learn linear behaviour patterns
between classes, is only indirectly enforced since the output interpolation is
performed at the probability level. This paper seeks to address this limitation
by mixing the classifiers directly instead of mixing the labels for each mixed
pair. We propose to define the target of each augmented sample as a uniquely
new classifier, whose parameters are a linear interpolation of the classifier
vectors of the input pair. The space of all possible classifiers is continuous
and spans all interpolations between classifier pairs. To make optimisation
tractable, we propose a dual-contrastive Infinite Class Mixup loss, where we
contrast the classifier of a mixed pair to both the classifiers and the
predicted outputs of other mixed pairs in a batch. Infinite Class Mixup is
generic in nature and applies to many variants of Mixup. Empirically, we show
that it outperforms standard Mixup and variants such as RegMixup and Remix on
balanced, long-tailed, and data-constrained benchmarks, highlighting its broad
applicability.Comment: BMVC 202
Localizing Actions from Video Labels and Pseudo-Annotations
The goal of this paper is to determine the spatio-temporal location of
actions in video. Where training from hard to obtain box annotations is the
norm, we propose an intuitive and effective algorithm that localizes actions
from their class label only. We are inspired by recent work showing that
unsupervised action proposals selected with human point-supervision perform as
well as using expensive box annotations. Rather than asking users to provide
point supervision, we propose fully automatic visual cues that replace manual
point annotations. We call the cues pseudo-annotations, introduce five of them,
and propose a correlation metric for automatically selecting and combining
them. Thorough evaluation on challenging action localization datasets shows
that we reach results comparable to results with full box supervision. We also
show that pseudo-annotations can be leveraged during testing to improve weakly-
and strongly-supervised localizers.Comment: BMV
No Spare Parts: Sharing Part Detectors for Image Categorization
This work aims for image categorization using a representation of distinctive
parts. Different from existing part-based work, we argue that parts are
naturally shared between image categories and should be modeled as such. We
motivate our approach with a quantitative and qualitative analysis by
backtracking where selected parts come from. Our analysis shows that in
addition to the category parts defining the class, the parts coming from the
background context and parts from other image categories improve categorization
performance. Part selection should not be done separately for each category,
but instead be shared and optimized over all categories. To incorporate part
sharing between categories, we present an algorithm based on AdaBoost to
jointly optimize part sharing and selection, as well as fusion with the global
image representation. We achieve results competitive to the state-of-the-art on
object, scene, and action categories, further improving over deep convolutional
neural networks
4-Connected Shift Residual Networks
The shift operation was recently introduced as an alternative to spatial
convolutions. The operation moves subsets of activations horizontally and/or
vertically. Spatial convolutions are then replaced with shift operations
followed by point-wise convolutions, significantly reducing computational
costs. In this work, we investigate how shifts should best be applied to high
accuracy CNNs. We apply shifts of two different neighbourhood groups to ResNet
on ImageNet: the originally introduced 8-connected (8C) neighbourhood shift and
the less well studied 4-connected (4C) neighbourhood shift. We find that when
replacing ResNet's spatial convolutions with shifts, both shift neighbourhoods
give equal ImageNet accuracy, showing the sufficiency of small neighbourhoods
for large images. Interestingly, when incorporating shifts to all point-wise
convolutions in residual networks, 4-connected shifts outperform 8-connected
shifts. Such a 4-connected shift setup gives the same accuracy as full residual
networks while reducing the number of parameters and FLOPs by over 40%. We then
highlight that without spatial convolutions, ResNet's downsampling/upsampling
bottleneck channel structure is no longer needed. We show a new, 4C shift-based
residual network, much shorter than the original ResNet yet with a higher
accuracy for the same computational cost. This network is the highest accuracy
shift-based network yet shown, demonstrating the potential of shifting in deep
neural networks.Comment: ICCV Neural Architects Workshop 201